Domain Adaptation of Natural Language Processing Systems

نویسندگان

  • John Blitzer
  • Alex Kulesza
  • Partha Pratim Talukdar
چکیده

DOMAIN ADAPTATION OF NATURAL LANGUAGE PROCESSING SYSTEMS John Blitzer Fernando Pereira Statistical language processing models are being applied to an ever wider and more varied range of linguistic domains. Collecting and curating training sets for each different domain is prohibitively expensive, and at the same time differences in vocabulary and writing style across domains can cause state-of-the-art supervised models to dramatically increase in error. The first part of this thesis describes structural correspondence learning (SCL), a method for adapting linear discriminative models from resource-rich source domains to resource-poor target domains. The key idea is the use of pivot features which occur frequently and behave similarly in both the source and target domains. SCL builds a shared representation by searching for a low-dimensional feature subspace that allows us to accurately predict the presence or absence of pivot features on unlabeled data. We demonstrate SCL on two text processing problems: sentiment classification of product reviews and part of speech tagging. For both tasks, SCL significantly improves over state of the art supervised models using only unlabeled target data. In the second part of the thesis, we develop a formal framework for analyzing domain adaptation tasks. We first describe a measure of divergence, the H∆H-divergence, that depends on the hypothesis class H from which we estimate our supervised model. We then use this measure to state an upper bound on the true target error of a model trained to minimize a convex combination of empirical source and target errors. The bound characterizes the tradeoff inherent in training on both the large quantity of biased source data and the small quantity of unbiased target data, and we can compute it from finite labeled and unlabeled samples of the source and target distributions under relatively weak assumptions. Finally, we confirm experimentally that the bound corresponds well to empirical target error for the task of sentiment classification. iv COPYRIGHT John Blitzer 2007

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model

In this paper, we propose to address the problem of domain adaptation for sequence labeling tasks via distributed representation learning by using a log-bilinear language adaptation model. The proposed neural probabilistic language model simultaneously models two different but related data distributions in the source and target domains based on induced distributed representations, which encode ...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Cross-Domain and Cross-Language Porting of Shallow Parsing

English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...

متن کامل

A Wikipedia based Algorithm for Online Adaptation of a Syntactic Parser

Adaptation of models to a new domain is important for many natural language tasks. Because without adaptation, NLP tools trained on news domain achieve sub par results on other domains. In many practical scenarios, the identity of the domain of the test set is unknown. For this difficult but important setting, we propose a novel method of adaptation using entity disambiguation systems to Wikipe...

متن کامل

We're Not in Kansas Anymore: Detecting Domain Changes in Streams

Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a strea...

متن کامل

Repérage des entités nommées pour l'arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French]

Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination The recognition of Arabic Named Entities (NE) is a potentially useful preprocessing step for many Natural Language Processing Applications, such as Machine Translation. This task is however made very complex by some peculiarities of the Arabic language. In this paper, we present a summary of our recent efforts...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007